Skip to content

Conversation

benrobby
Copy link

@benrobby benrobby commented Aug 22, 2025

What changes were proposed in this pull request?

Screenshot 2025-08-23 at 02 33 43

Why are the changes needed?

  • it's too easy to miss inadvertent type coercion changes, and some of the existing tables in the codebase have become inconsistent.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • this is a unit test

Was this patch authored or co-authored using generative AI tooling?

Yes, used claude to generate the table printing utils. Generated-by: claude-sonnet-4

@benrobby benrobby changed the title [SPARK-53355][PYTHON] document python udf type behavior [SPARK-53355][PYTHON][SQL] document python udf type behavior Aug 23, 2025
@benrobby
Copy link
Author

@HyukjinKwon and @asl could you take a look?

Copy link
Contributor

@asl3 asl3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return [
("byte_values", ByteType(), df([(-128,), (127,), (0,)])),
("byte_null", ByteType(), df([(None,), (42,)])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a comment with context about the null handling importance for input type vs. return type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have an example, what comment are you looking for exactly?

Copy link
Contributor

@asl3 asl3 Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit, but i meant we could clarify the context for the return_types and input_types

@HyukjinKwon HyukjinKwon changed the title [SPARK-53355][PYTHON][SQL] document python udf type behavior [SPARK-53355][PYTHON][SQL] Document python udf type behavior Aug 24, 2025
@@ -0,0 +1,5 @@
These tests capture input/output type interfaces between python udfs and the engine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run this in CI? If so, we should add __init__.py ,and add this modul into dev/sparktestsupport/modules.py

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me try to add it. Do we need any other reviewer for the infra changes?

@HyukjinKwon
Copy link
Member

My only concern is that I don't think we should say this is the standard behaviours ... as some of behaviours are weird. I am fine with this change as long as we're all on the same page that some behaviours here might change in the future.

@benrobby
Copy link
Author

My only concern is that I don't think we should say this is the standard behaviours ... as some of behaviours are weird. I am fine with this change as long as we're all on the same page that some behaviours here might change in the future.

Yes, agreed. I've also updated the readme to reflect that these tests are purely internal and not an API documentation

@benrobby benrobby changed the title [SPARK-53355][PYTHON][SQL] Document python udf type behavior [SPARK-53355][PYTHON][SQL][INFRA] Document python udf type behavior Aug 25, 2025
@benrobby benrobby changed the title [SPARK-53355][PYTHON][SQL][INFRA] Document python udf type behavior [SPARK-53355][PYTHON][SQL][INFRA] test python udf type behavior Aug 25, 2025
@benrobby benrobby requested review from asl3 and HyukjinKwon August 25, 2025 09:23
],
)

pyspark_types = Module(
Copy link
Contributor

@zhengruifeng zhengruifeng Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of hesitant whether it is too much to add a new testing module.
maybe we can just add these tests in pyspark-sql?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference. @HyukjinKwon ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyspark-sql alone should be fine

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I moved the test to run under pyspark-sql

@github-actions github-actions bot removed the INFRA label Aug 26, 2025
@benrobby benrobby changed the title [SPARK-53355][PYTHON][SQL][INFRA] test python udf type behavior [SPARK-53355][PYTHON][SQL] test python udf type behavior Aug 27, 2025
@benrobby
Copy link
Author

@HyukjinKwon this one is ready from my side :)

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor

@benrobby the new test fails in two scheduled jobs:
https://github.com/apache/spark/actions/runs/17398781217/job/49386969447
https://github.com/apache/spark/actions/runs/17429965634/job/49486072652

would you mind taking a look? If it is acceptable, then we can skip it in the two envs.
you can check the env in step List Python Packages

@benrobby
Copy link
Author

benrobby commented Sep 5, 2025

@benrobby the new test fails in two scheduled jobs: https://github.com/apache/spark/actions/runs/17398781217/job/49386969447 https://github.com/apache/spark/actions/runs/17429965634/job/49486072652

would you mind taking a look? If it is acceptable, then we can skip it in the two envs. you can check the env in step List Python Packages

Thanks for flagging. This is caused by older numpy versions implementing __repr__ / __string__ differently. So it's not an actual type change, just a test-only issue. I'll adjust the test to align this, will ping you once ready

@benrobby
Copy link
Author

benrobby commented Sep 5, 2025

@zhengruifeng here's the fixup pr #52247 :)

zhengruifeng pushed a commit that referenced this pull request Sep 18, 2025
### What changes were proposed in this pull request?

- this is a minor followup to #52105, we noticed that the test breaks in two spark master runs with a different env
- the root cause was that numpy 1.x implements `__repr__` differently

### Why are the changes needed?

- fix `Build / Python-only (master, Minimum dependencies of PySpark)`

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

ran tests locally with numpy 1.22.4

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #52247 from benrobby/SPARK-53355-fix-numpy-repr.

Authored-by: Ben Hurdelhey <ben.hurdelhey@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants